Bioinformatics An Introduction 4th Edition (Jeremy Ramsden)

18.1 Transcriptomics

275

tissues according to their gene expression proﬁles; it might be inferred that tissues

with the same or similar expression proﬁle belong to the same clinical state.

If a set of experiments comprising samples prepared from cells grown under mm

different conditions has been carried out, then the set of normalized intensities (i.e.,

transcript abundances) for each experiment deﬁnes a point inmm-dimensional expres-

sion space, whose coordinates give the (normalized) degrees of expression. Distances

between the points can be calculated by, for example, the Euclidean distance metric,

that is,

d equals left bracket sigma summation Underscript i equals 1 Overscript m Endscripts left parenthesis a Subscript i Baseline minus b Subscript i Baseline right parenthesis squared right bracket Superscript 1 divided by 2 Baseline commad =

[ m

i=1

(ai −bi)²

]1/2

(18.1)

for two samples aa and bb subjected to mm different conditions. Clustering algorithms

(Sect. 13.2.1) can then be used to group transcripts on the basis of their similarities.

The hierarchical clustering procedure is the same as that used to construct phylogenies

(Sect. 17.7); that is, the closest pair of transcripts forms the ﬁrst cluster, the transcript

with the closest mean distance to the ﬁrst cluster forms the second cluster, and so

on. This is the unweighted pair-group method average (UPGMA); variants include

single-linkage clustering, in which the distance between two clusters is calculated as

the minimum distance between any members of the two clusters, and so on.

Fuzzy clustering algorithms may be more successful than the above

“hard”

schemes for large and complex datasets. Fuzzy schemes allow points to belong to

more than one cluster. Degree of membership is deﬁned by

u Subscript r comma s Baseline equals 1 divided by sigma summation Underscript j equals 1 Overscript m Endscripts left parenthesis StartFraction d left parenthesis x Subscript r Baseline comma theta Subscript s Baseline right parenthesis Over d left parenthesis x Subscript r Baseline comma theta Subscript j Baseline right parenthesis EndFraction right parenthesis Superscript 1 divided by left parenthesis q minus 1 right parenthesis Baseline comma r equals 1 comma ellipsis comma upper N semicolon s equals 1 comma ellipsis comma m commaur,s = 1/

j=1

( d(xr, θs)

d(xr, θ j)

)1/(q−1)

,r = 1, . . . , N; s = 1, . . . , m,

(18.2)

forupper NN points andmm clusters (mm is given at the start of the algorithm), whered left parenthesis x Subscript i Baseline comma theta Subscript j Baseline right parenthesisd(xi, θ j)

is the distance between the point x Subscript ixi and the cluster represented by theta Subscript jθ j, and q greater than 1q > 1 is

the fuzzifying parameter. The cost function

sigma summation Underscript i equals 1 Overscript upper N Endscripts sigma summation Underscript j equals 1 Overscript m Endscripts u Subscript r comma s Superscript j Baseline d left parenthesis x Subscript i Baseline comma theta Subscript j Baseline right parenthesis

i=1

j=1

u ^j

r,s^d⁽^xⁱ^{, θ}^j⁾

(18.3)

is minimized (subject to the condition that the u Subscript i comma jui, j sum to unity) and clustering

converges to cluster centres corresponding to local minima or saddle points of the

cost function. The procedure is typically repeated for increasing number of clusters

until some criterion for clustering quality becomes stable; for example, the partition

coefﬁcient

left parenthesis 1 divided by upper N right parenthesis sigma summation Underscript i equals 1 Overscript upper N Endscripts sigma summation Underscript j equals 1 Overscript m Endscripts u Subscript i comma j Superscript 2 Baseline period(1/N)

i=1

j=1

u²

i, j^.

(18.4)